Author

Kevin Cisneros-Cuevas, Soren Fliegel, Chris Liu, Haoxian Liu

Code
library(tidyverse)
library(gganimate)
library(knitr)
library(gifski)
library(broom)
library(kableExtra)

Previous code from project proposal:

Code
life_expectancy <- read.csv("lex.csv")
gdp_per_capita <- read.csv("gdp_pcap.csv")

clean_life_expectancy <- life_expectancy |> 
  # calculating proportion of missing values for each row
  # source: https://www.statology.org/rowmeans-in-r/
  filter(rowMeans(across(-country, is.na)) <= 0.2) |> 
  # gsub: https://www.statology.org/gsub-r/
  # rename: https://dplyr.tidyverse.org/reference/rename.html
  rename_with(~ gsub("^X", "lex_", .), starts_with("X"))

convert_k <- function(vec){
  has_k <- str_detect(vec, "k")
  vec_numeric <- as.numeric(str_replace(vec, "k", "")) 
  return(vec_numeric * ifelse(has_k, 1000, 1))   
}

clean_gdp_per_capita <- gdp_per_capita |> 
  rename_with(~ gsub("^X", "gdp_per_capita_", .), starts_with("X")) |> 
  # all columns type -> numeric
  mutate(across(-country, convert_k))

long_life_expectancy <- clean_life_expectancy |> 
  pivot_longer(
    cols = starts_with("lex_"), 
    names_to = "year", 
    names_prefix = "lex_", 
    values_to = "life_expectancy"
  )

long_gdp_per_capita <- clean_gdp_per_capita |> 
  pivot_longer(
    cols = starts_with("gdp_per_capita_"), 
    names_to = "year", 
    names_prefix = "gdp_per_capita_", 
    values_to = "gdp_per_capita"
  )

clean_long_df <- long_life_expectancy |> 
  inner_join(long_gdp_per_capita, 
             by = c("country", "year")) 

2.1 Data Visualization

1.The relationship between two quantitative variables you are investigating:

Code
clean_long_df |>
  filter(year == 2017) |>
  ggplot(aes(x = gdp_per_capita, y = life_expectancy)) +
  geom_jitter(alpha = 0.5, size = 1) +
  labs(
    x = "GDP per Capita (2017 US$ PPP)",
    y = "",
    title = "GDP per Capita vs. Life Expectancy in 2017 Globally",
    subtitle = "Life Expectancy (in years)") + 
  theme_linedraw()

In this first above relationship, depicting GDP per Capita and Life Expectancy Globally in 2017, we see a positive relationship between the two variables. At first, as GDP per Capita increases, life expectancy increases dramatically. Then, as GDP per Capita reaches increases further, towards the bound of our graph, the positive relationship is less clear and there are diminishing returns in terms of life expectancy. This pattern is simlar to the graph of a logarithm. This graph shows an intensely unequal world, with GDP per Capita varying greatly for countries of varying development and wealth. Additionally, we see an unequal spread of life expectancy even among wealthier countries, where wealth inequality and various healthcare systems likely come into play.

2. How this relationship (from #1) has changed over time:

Code
clean_long_df$year <- as.numeric(clean_long_df$year)

animated_over_time <- clean_long_df |>
  filter(year <= 2017) |>
  ggplot(aes(x = gdp_per_capita, y = life_expectancy)) +
  geom_jitter(alpha = 0.5, size = 1) +
  labs(
    x = "GDP per Capita (2017 US$ PPP)",
    y = "",
    title = "GDP per Capita vs. Life Expectancy: Year {round(frame_time)}",
    subtitle = "Life Expectancy (in years)") +
  theme_linedraw() +
  transition_time(year) +
  ease_aes("linear")

# Source: https://gganimate.com/
# should show in the "viewer" tab and also when rendered in html
animate(animated_over_time, renderer = gifski_renderer(), nframes = 150, duration = 20)

This animation shows the same graph from above (GDP per Capita vs Life Expectancy) over the years 1800-2017. The value here comes from our ability to see an increasing gap between various country’s wealth levels (and life expectancy as a result), but also in the overall life expectancy rising across the globe. This chart reinforces the idea presented above of ever-increasing wealth inequality globally, with some countries exploding in wealth and others staying nearly the same, all from relatively similar starting points in the year 1800. This animation also shows that the logarithmic pattern is followed with a medium strength.

2.2 Linear Regression

Before running the linear regression, summarising the data to one x value and one y value per country in the years 1939 to 1945 would be a great way to simplify the regression model and ensuring each country is represented consistently. We decided to investigate 1939 to 1945 because these are the years the Second World War occured.

The linear regression contains two variables of interest: average GDP per capitia for each country during the Second World War (explanatory variable: \(X\)) and average life expectancy for each country during the Second World War (response variable: \(Y\)).

Code
options(scipen = 999) # disable scientific notation!
summarised_lm <- clean_long_df |> 
  filter(year %in% 1939:1945) |>
  group_by(country) |> 
  summarize(
    avg_gdp_per_capita = mean(gdp_per_capita, na.rm = TRUE),
    avg_life_expectancy = mean(life_expectancy, na.rm = TRUE)
    ) |> 
  ungroup()

# Get Regression Model:
lm_model <- lm(avg_life_expectancy ~ avg_gdp_per_capita, 
               data = summarised_lm)

From the linear regression model, the population regression model is represented by this equation:

\[ Y = 35.321 + 0.0022X + \epsilon \]

Where \(\epsilon\) ~ \(\textit{N}(0, \sigma)\) accounts for random noise.

2.3 Model Fit

Code
summary_var <- lm_model |> 
  augment() |> 
  # we only need three columns
  select(avg_life_expectancy, .fitted, .resid) |> 
  pivot_longer(cols = c(avg_life_expectancy, .fitted, .resid),
               names_to = "variables",
               values_to = "values"
               ) |> 
  map_at("variables", as.factor) |> 
  bind_cols() |> 
  # change the variable names
  mutate(
    variables = fct_recode(variables, 
                          "Response" = "avg_life_expectancy",
                          "Fitted" = ".fitted",
                          "Residual" = ".resid")
  ) |> 
  # make table fancy
  group_by(variables) |> 
  summarize(variance = round(var(values), 2)) |> 
  kable(col.names = c("Variable Name", "𝛔̂²"),
        caption = "Table 1: Variability of the Regression Model",
        align = "c") |> 
  
  kable_classic(full_width = F, html_font = "Cambria")


summary_var
Table 1: Variability of the Regression Model
Variable Name 𝛔̂²
Fitted 43.29
Residual 51.17
Response 94.46

In Table 1, the estimated variances have been calculated for the predicted life expectancy (fitted values), the residuals, and the actual life expectancy (response values). First, the variance of the response values represents the total amount of variation in life expectancy across observations. Second, we have the variance of the fitted value, which captures how much of the variability in life expectancy is explained by GDP per capita. Third, is the residual variance, which represents the unexplained variability - that is, the portion of life expectancy variation that GDP per capita does not account for.

Assessing the Proportion of Variability Explained by the Model

To determine the proportion of the variability in the response values that was accounted in our model, we would first need to calculate the , which is done by doing this:

\[ R^2 = \frac{\hat{\sigma}^2_{\text{Fitted}}}{\hat{\sigma}^2_{\text{Response}}} \]

Code
R2 <- 43.29 / 94.46  

round(R2 * 100, 2)
[1] 45.83

Based on the result, our model explains about 45.83% of the variability in life expectancy using GDP per capita. This means that 54.17% of variability remains unexplained. With an of 45.83%, the quality of our model is moderate. This suggest that the model is useful but not highly predictive. Although it will give us an insight into the relationship between economic prosperity and life expectancy, it lacks other factors needed for a highly accurate prediction. In other words, GDP per capita is not the sole determining factor for a person’s life expectancy. There are other things to consider as well, such as healthcare access, education, environmental conditions, and government policies.

3.1 Visualizing Simulations from the Model

Code
set.seed(537) # reproducibility 

rand_error <- function(x, mean = 0, sd) {
  x + rnorm(length(x), mean = mean, sd = sd)
}

pred_life <- predict(lm_model)
est_sigma <- sigma(lm_model)
sim_response <- tibble(simulated_response = rand_error(pred_life, sd = est_sigma))

full_data <- summarised_lm |>
  select(avg_gdp_per_capita, avg_life_expectancy) |>
  bind_cols(sim_response) 

full_data |>
  ggplot(aes(x = avg_gdp_per_capita, y = simulated_response)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Simulated Data GDP vs Life Expectancy, 1939 - 1945",
       x = "GDP Per Capita", 
       y = " ",
       subtitle = "Simulated Life Expectancy") +
  theme_linedraw() + 
  xlim(0, 20000) + 
  ylim(0, 100)
full_data |>
  ggplot(aes(x = avg_gdp_per_capita, y = avg_life_expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Observed GDP vs Life Expectancy, 1939 - 1945",
       x = "GDP per Capita", 
       y = "",
       subtitle = "Actual Life Expectancy") +
  theme_linedraw() + 
  xlim(0, 20000) + 
  ylim(0, 100)

3.2 Generating Multiple Predictive Checks